30 research outputs found

    Capturing and Analyzing the Execution Control Flow of OpenMP Applications

    Get PDF
    An important aspect of understanding the behavior of applications with respect to their performance, overhead, and scalability characteristics is knowledge of their execution control flow. High level knowledge of which functions or constructs were executed after which other constructs allows reasoning about temporal application characteristics such as cache reuse. This paper describes an approach to capture and visualize the execution control flow of OpenMP applications in a compact way. Our approach does not require a full trace of program execution events but is instead based on a straightforward extension to the summary data already collected by an existing profiling tool. In multithreaded applications each thread may define its own independent flow of control, complicating both the recording as well as the visualization of the execution dynamics. Our approach allows for the full flexibility with respect to independent threads. However, the most common usage models of OpenMP have threads operate in a largely uniform way, synchronizing frequently at sequence points and diverging only to operate on different data items in worksharing constructs. Our approach accounts for this by offering a simplified representation of the execution control flow for threads with similar behavior

    DART-MPI: An MPI-based Implementation of a PGAS Runtime System

    Full text link
    A Partitioned Global Address Space (PGAS) approach treats a distributed system as if the memory were shared on a global level. Given such a global view on memory, the user may program applications very much like shared memory systems. This greatly simplifies the tasks of developing parallel applications, because no explicit communication has to be specified in the program for data exchange between different computing nodes. In this paper we present DART, a runtime environment, which implements the PGAS paradigm on large-scale high-performance computing clusters. A specific feature of our implementation is the use of one-sided communication of the Message Passing Interface (MPI) version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated the performance of the implementation with several low-level kernels in order to determine overheads and limitations in comparison to the underlying MPI-3.Comment: 11 pages, International Conference on Partitioned Global Address Space Programming Models (PGAS14

    From reactive to proactive load balancing for task‐based parallel applications in distributed memory machines

    Get PDF
    Load balancing is often a challenge in task-parallel applications. The balancing problems are divided into static and dynamic. “Static” means that we have some prior knowledge about load information and perform balancing before execution, while “dynamic” must rely on partial information of the execution status to balance the load at runtime. Conventionally, work stealing is a practical approach used in almost all shared memory systems. In distributed memory systems, the communication overhead can make stealing tasks too late. To improve, people have proposed a reactive approach to relax communication in balancing load. The approach leaves one dedicated thread per process to monitor the queue status and offload tasks reactively from a slow to a fast process. However, reactive decisions might be mistaken in high imbalance cases. First, this article proposes a performance model to analyze reactive balancing behaviors and understand the bound leading to incorrect decisions. Second, we introduce a proactive approach to improve further balancing tasks at runtime. The approach exploits task-based programming models with a dedicated thread as well, namely . Nevertheless, the main idea is to force not only to monitor load; it will characterize tasks and train load prediction models by online learning. “Proactive” indicates offloading tasks before each execution phase proactively with an appropriate number of tasks at once to a potential victim (denoted by an underloaded/fast process). The experimental results confirm speedup improvements from to in important use cases compared to the previous solutions. Furthermore, this approach can support co-scheduling tasks across multiple applications

    A compiler for universal photonic quantum computers

    Full text link
    Photons are a natural resource in quantum information, and the last decade showed significant progress in high-quality single photon generation and detection. Furthermore, photonic qubits are easy to manipulate and do not require particularly strongly sealed environments, making them an appealing platform for quantum computing. With the one-way model, the vision of a universal and large-scale quantum computer based on photonics becomes feasible. In one-way computing, the input state is not an initial product state, but a so-called cluster state. A series of measurements on the cluster state's individual qubits and their temporal order, together with a feed-forward procedure, determine the quantum circuit to be executed. We propose a pipeline to convert a QASM circuit into a graph representation named measurement-graph (m-graph), that can be directly translated to hardware instructions on an optical one-way quantum computer. In addition, we optimize the graph using ZX-Calculus before evaluating the execution on an experimental discrete variable photonic platform.Comment: 8 pages, 6 figure

    The calibration and evaluation of speed-dependent automatic zooming interfaces.

    Get PDF
    Speed-Dependent Automatic Zooming (SDAZ) is an exciting new navigation technique that couples the user's rate of motion through an information space with the zoom level. The faster a user scrolls in the document, the 'higher' they fly above the work surface. At present, there are few guidelines for the calibration of SDAZ. Previous work by Igarashi & Hinckley (2000) and Cockburn & Savage (2003) fails to give values for predefined constants governing their automatic zooming behaviour. The absence of formal guidelines means that SDAZ implementers are forced to adjust the properties of the automatic zooming by trial and error. This thesis aids calibration by identifying the low-level components of SDAZ. Base calibration settings for these components are then established using a formal evaluation recording participants' comfortable scrolling rates at different magnification levels. To ease our experiments with SDAZ calibration, we implemented a new system that provides a comprehensive graphical user interface for customising SDAZ behaviour. The system was designed to simplify future extensions---for example new components such as interaction techniques and methods to render information can easily be added with little modification to existing code. This system was used to configure three SDAZ interfaces: a text document browser, a flat map browser and a multi-scale globe browser. The three calibrated SDAZ interfaces were evaluated against three equivalent interfaces with rate-based scrolling and manual zooming. The evaluation showed that SDAZ is 10% faster for acquiring targets in a map than rate-based scrolling with manual zooming, and SDAZ is 4% faster for acquiring targets in a text document. Participants also preferred using automatic zooming over manual zooming. No difference was found for the globe browser for acquisition time or preference. However, in all interfaces participants commented that automatic zooming was less physically and mentally draining than manual zooming

    Programming Abstractions for Data Locality

    Get PDF
    The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal

    OpenMP Application Profiling – State of the Art and Directions for the Future

    Get PDF
    OpenMP is a successful approach to writing threaded parallel applications. This article describes the state of the art in performance profiling OpenMP applications, covering vendor performance tools and platform independent techniques. The features of the OpenMP profiler ompP are described in detail and an outlook of future directions in this area is given

    OpenMP-centric Performance Analysis of Hybrid Applications

    No full text
    Abstract—Several performance analysis tools support hybrid applications. Most originated as MPI profiling or tracing tools and OpenMP capabilities were added to extend the performance analysis capabilities for the hybrid parallelization case. In this paper we describe our experience with the other path to support both programming paradigms. Our starting point is a profiling tool for OpenMP called ompP that was extended to handle MPI related data. The measured data and the method of presentation follow our focus on the OpenMP side of the performance optimization cycle. For example, the existing overhead classification scheme of ompP was extended to cover time in MPI calls as a new type of overhead. I

    Distributed application monitoring for clustered SMP architectures

    No full text
    Performance analysis for terascale computing requires a combination of new concepts including distribution, on-line processing and automation. As a foundation for tools realizing these concepts, we present a distributed monitoring approach for clustered SMP architectures that tries to minimize the perturbation of the target application while retaining flexibility with respect to filtering and processing of performance data. We achieve this goal by dividing the monitor in a passive monitoring library linked to the application and an active component called runtime information producer (RIP) that provides performance data (metric- and event based) for individual nodes. Instead of adding an additional layer in the monitoring system that integrates performance data form the individual RIPs we include a directory service as a third component in our approach. Querying this directory service, tools discover which RIPs provide the data they need
    corecore